A flexible model-free prediction-based framework for feature ranking

Despite the availability of numerous statistical and machine learning tools for joint feature modeling, many scientists investigate features marginally, i.e., one feature at a time. This is partly due to training and convention but also roots in scientists’ strong interests in simple visualization and interpretability. As such, marginal feature ranking for some predictive tasks, e.g., prediction of cancer driver genes, is widely practiced in the process of scientific discoveries. In this work, we focus on marginal ranking for binary classification, one of the most common predictive tasks. We argue that the most widely used marginal ranking criteria, including the Pearson correlation, the two-sample t test, and two-sample Wilcoxon rank-sum test, do not fully take feature distributions and prediction objectives into account. To address this gap in practice, we propose two ranking criteria corresponding to two prediction objectives: the classical criterion (CC) and the Neyman-Pearson criterion (NPC), both of which use model-free nonparametric implementations to accommodate diverse feature distributions. Theoretically, we show that under regularity conditions, both criteria achieve sample-level ranking that is consistent with their population-level counterpart with high probability. Moreover, NPC is robust to sampling bias when the two class proportions in a sample deviate from those in the population. This property endows NPC good potential in biomedical research where sampling biases are ubiquitous. We demonstrate the use and relative advantages of CC and NPC in simulation and real data studies. Our model-free objective-based ranking idea is extendable to ranking feature subsets and generalizable to other prediction tasks and learning objectives.


Introduction
From scientific research to industrial applications, practitioners often face the challenge to rank features for a prediction task. Among the ranking tasks performed by scientists and practitioners, a large proportion belongs to marginal ranking; that is, ranking features based on the relation between the response variable and one feature at a time, ignoring other available features. For example, to predict cancer driver genes, biomedical researchers need to first extract predictive features from patients' data. Then they decide whether each extracted feature is informative by examining its marginal distributions in tumor and normal tissues, usually by boxplots and histograms (Davoli et al., 2013;Lyu et al., 2020). Another example is genome-wide association studies, where single nucleotide polymorphisms are ranked by their marginal associations with a phenotype (Buniello et al., 2019).
From a prediction perspective, marginal feature ranking may seem suboptimal, as multiple features usually have dependence and would thus be jointly predictive of the response variable beyond a simple additive manner. Hence, interpretation of all features' importance in a multivariate predictive model is an active area of research; popular criteria include the SHAP value (Lundberg and Lee, 2017) and the feature importance measured by Gini index in the random forest (RF) algorithm. However, such "joint" feature ranking is too computationally intensive when the candidate feature number is huge, as it requires all candidate features to be input into one multivariate model; also, it does not reflect a feature's marginal predictive power when other highly-correlated candidate features are in the model. Moreover, the popularity of marginal feature ranking roots in not only researchers' education backgrounds and discipline conventions but also their strong desire for simple interpretation and visualization in the trial-and-error scientific discovery process. As such, marginal feature ranking has been an indispensable data-analysis step in the scientific community, and it will likely stay popular.
In practice, statistical tests (e.g., two-sample t test and two-sample Wilcoxon rank-sum test) and association measures (e.g., Pearson correlation) are often used to rank features marginally (Davoli et al., 2013;Lyu et al., 2020). However, these tests and association measures do not reflect the objective of a prediction task. For example, if the classification error is of concern, it is unclear how the significance of these tests or the values of these measures are connected to the classification error. This misalignment of ranking criterion and prediction objective is undesirable: the resulting feature rank list does not reflect the marginal importance of each feature for the prediction objective. Hence, scientists and practitioners call for a marginal ranking approach that meets the prediction objective.
In this work, we focus on marginal ranking for binary prediction, which can be formulated as binary classification in machine learning. Binary classification has multiple prediction objectives, which we refer to as paradigms here. These paradigms include (1) the classical paradigm that minimizes the classification error, i.e., a weighted sum of the type I and type II errors where the weights are the class probabilities (Hastie et al., 2009;James et al., 2013), (2) the cost-sensitive learning paradigm that replaces the two error weights by pre-determined costs (Elkan, 2001;Zadrozny et al., 2003), (3) the Neyman-Pearson (NP) et al., 2002;Scott and Nowak, 2005;Tong, 2013;Tong et al., 2018), and (4) the global paradigm that focuses on the overall prediction accuracy under all possible thresholds: the area under the receiver-operating-characteristic curve (AUROC) or the area under the precision-recall curve (AUPRC). Here we consider marginal ranking of features under the classical and NP paradigms, and we define the corresponding ranking criteria as the classical criterion (CC) and the Neyman-Pearson criterion (NPC). To implement CC and NPC, we take a model-free approach by using nonparametric estimates of class-conditional feature densities. This approach makes CC and NPC more adaptive to diverse feature distributions than existing criteria for marginal feature ranking. The idea behind CC and NPC is easily generalizable to the cost-sensitive learning paradigm and the global paradigm.
It is worth highlighting that NPC is robust to sampling bias; that is, even when the class proportions in a sample deviate from those in the population, NPC still achieves feature ranking consistency between sample and population with high probability. This nice property makes NPC particularly useful for disease diagnosis, where a long-standing obstacle is that the proportions of diseased patients and healthy people in medical records do not reflect the proportions in the population.
The rest of the paper is organized as follows. In Section 2, we define the population-level CC and NPC as the oracle criteria under the classical and NP paradigms, respectively. In Section 3, we define the sample-level CC and NPC, and we develop model-free algorithms to implement them. In Section 4, we derive theoretical results regarding the ranking consistency of the sample-level CC and NPC in relation to their population counterparts. In Section 5, we use simulation studies to demonstrate the performance of sample-level CC and NPC in ranking low-dimensional and high-dimensional features. We also implement variants of sample-level CC and NPC based on the support vector machine (SVM) algorithm and show that they are less robust than our proposed sample-level CC and NPC to sampling bias. In Section 6, we apply sample-level CC and NPC to marginal feature ranking in two real datasets. Using the first dataset regarding breast cancer diagnosis, we show that both criteria can identify informative features, many of which have been previously reported; we also provide a Supplementary Excel File for literature evidence. Using the second dataset for prostate cancer diagnosis from urine samples, we demonstrate that NPC is robust to sampling bias. In both simulation and real-data studies, we compare sample-level CC and NPC with joint feature ranking criteria-the SHAP value and the feature importance measures in the RF algorithm-and commonly-used marginal ranking criteria that may give feature ranking misaligned with the prediction objective, including the Pearson correlation, the distance correlation (Székely et al., 2009), 1 the two-sample t test, and the two-sample Wilcoxon rank-sum test. We conclude with a discussion in Section 7. Additional materials and proofs of lemmas, propositions, and theorems are relegated to the Appendix.

Population-level ranking criteria
In this section, we introduce two objective-based marginal feature ranking criteria, on the population level, under the classical paradigm and the Neyman-Pearson (NP) paradigm. As argued previously, when one has a learning/prediction objective, the feature ranking criterion should be in line with that. Concretely, the j-th ranked feature should be the one that achieves the j-th best performance based on that objective. This objective-based feature ranking perspective is extendable to ranking feature subsets (e.g., feature pairs). Although we focus on marginal feature ranking in this work, to cope with this future extension, our notations in the methodology and theory development are compatible with ranking feature subsets.

Notations and classification paradigms
We first introduce essential mathematical notations to facilitate our discussion. Let (X, Y) be a pair of random observations where X ∈ X ⊆ ℝ d is a vector of features and Y ∈ {0, 1} indicates the class label of X. A classifier ϕ: X 0, 1 maps from the feature space to the label space. A loss function assigns a cost to each misclassified instance ϕ(X) ≠ Y, and the risk is defined as the expectation of this loss function with respect to the joint distribution of (X, Y). We adopt in this work a commonly used loss function, the 0-1 loss: 1(ϕ(X) ≠ Y ), where 1( ⋅ ) denotes the indicator function. Let ℙ and E denote the generic probability distribution and expectation, whose meaning depends on specific contexts. With the choice of the 0-1 loss function, the risk is the classification error: , which is aligned with most practitioners' interest in classifier evaluation. Note that in this work, the 0-1 loss is only used as an evaluation criterion in our development of marginal ranking criteria, not as a loss function for training a classifier from data.
In this paper, we call the learning objective of minimizing R(·) the classical paradigm. Under this paradigm, one aims to mimic the classical oracle classifier φ* that minimizes the population-level classification error, imposes a weighting of R 0 and R 1 by π 0 and π 1 . This is not always desirable. For example, when practitioners know the explicit costs for making type I and II errors: c 0 and c 1 , they may want to optimize the criterion c 0 R 0 (·) + c 1 R 1 (·), which is often referred to as the cost-sensitive learning paradigm.
In parallel to the classical paradigm, we consider the NP paradigm, which aims to mimic the level-α NP oracle classifier that minimizes the type II error while constraining the type I error under α, a user-specified type I error upper bound, Usually, α is a small value (e.g., 5% or 10%), reflecting a conservative attitude towards the type I error. As the development of classifiers under the NP paradigm is relatively new, here we review the NP oracle classifier φ α *( ⋅ ). Motivated by the classic NP Lemma (Appendix G) and a correspondence between classification and statistical hypothesis testing, φ α * in (1) can be constructed by thresholding p 1 (·)/p 0 (·) at a proper level C α * (Tong, 2013): where C α * is such that ℙ p 1 (X)/p 0 (X) > C α * | Y = 0 = α, and the estimation of C α * is introduced in Section 3.2.
In addition to the above three paradigms, a common practice is to evaluate a classification algorithm by its AUROC or AUPRC, which we refer to as the global paradigm. In contrast to the above three paradigms that lead to a single classifier, which has its corresponding type I and II errors, the global paradigm evaluates a classification algorithm by aggregating its all possible classifiers with type I errors ranging from zero to one. For example, the oracle AUROC is the area under the curve R 0 φ α * , 1 − R 1 φ α * : α ∈ [0, 1] .

Classical and Neyman-Pearson criteria on the population level
Different learning/prediction objectives in classification induce distinct feature ranking criteria. We first define the population-level CC and NPC. Then we show that these two criteria lead to different rankings of features in general, and that NPC may rank features differently at different α values. We denote by φ A * and φ αA * , respectively, the classical oracle classifier and the level-α NP oracle classifier that only use the features indexed by A ⊆ {1, …, d}. This paper focuses on the case when |A| = 1. Concretely, under the classical paradigm, the classical oracle classifier on index set A, φ A * , achieves where φ A : X ⊆ ℝ d 0, 1 is any mapping that first projects X ∈ ℝ d to its |A|-dimensional where η A x A = E Y | X A = x A is the regression function using only features in the index set A, and p 1A and p 0A denote the class-conditional probability density functions of the features X A . Suppose that candidate feature subsets denoted by A 1 , …, A J are provided, which may be enumerated by a computational algorithm or curated by domain experts. We define the population-level classical criterion (p-CC) of A i as its optimal risk R φ A i * ; i.e., A 1 , …, A J will be ranked based on R φ A 1 * , …, R φ A J * , with the smallest being ranked the top.
The prefix "p" in p-CC indicates "population-level." Note that R φ A i * represents A i 's best achievable performance measure under the classical paradigm and does not depend on any specific models assumed for the distribution of (X, Y).
Under the NP paradigm, the NP oracle classifier defined on the index set A, φ αA * , achieves By the NP Lemma, For a given level α, we define the population-level Neyman-Pearson criterion (p-NPC) of A i as its optimal type II error R 1 φ αA i * ; i.e., A 1 , …, A J will be ranked based on R 1 φ αA 1 * , …, R 1 φ αA J * , with the smallest being ranked the top.
For a graphical illustration of marginal feature ranking by p-CC and p-NPC, please see Figure H.1 in Appendix H. It is worth noting that p-CC and p-NPC do not always give the same feature ranking. For a toy example, we compare two features X 1 , X 2 ∈ ℝ, 2 whose class-conditional distributions are the following Gaussians: and the class priors are equal, i.e., π 0 = π 1 = .5. It can be calculated that R φ 1 * = . 106 and R φ 2 * = . 113. Therefore, R φ 1 * < R φ 2 * , and p-CC ranks feature 1 higher than feature 2. The comparison is more subtle for p-NPC. If we set α = .01, R 1 φ α 1 larger than R 1 φ α 2 * = . 299. However, if we set α = .20, R 1 φ α 1 * = . 049 is smaller than R 1 φ α 2 * = . 084. Figure 1 illustrates the NP oracle classifiers for these α's.
This example suggests a general phenomenon that feature ranking depends on the userchosen criteria. For some α values (e.g., α = .20 in the example), p-NPC and p-CC agree on the ranking, while for others (e.g., α = .01 in the example), they disagree. 3 This observation calls for the development of sample-level CC and NPC.

Sample-level ranking criteria
In the following text, we refer to sample-level CC and NPC as "s-CC" and "s-NPC" respectively. In the same model-free spirit of the p-CC and p-NPC definitions, we use model-free nonparametric techniques to construct s-CC and s-NPC. Admittedly, such construction would be impractical when the feature subsets to be ranked have large cardinalities. However, since we are mainly interested in marginal feature ranking, with intended extension to small subsets such as feature pairs, model-free nonparametric techniques are appropriate.
In the methodology and theory sections, we assume the following sampling scheme.
Suppose we have a training dataset S = S 0 ∪ S 1 , where S 0 = X 1 0 , …, X m 0 are independent and identically distributed (i.i.d.) class 0 observations, S 1 = X 1 1 , …, X n 1 are i.i.d. class 1 observations, and S 0 is independent of S 1 . The sample sizes m and n are considered as fixed positive integers. The construction of both s-CC and s-NPC involves splitting the class 0 and class 1 observations. To increase stability, we perform multiple random splits.
As in the definitions of p-CC and p-NPC, we use notations to allow for the extension to ranking feature subsets. For A ⊆ {1, …, d} with |A| = l, recall the classical oracle classifier restricted to A, φ A * (x), defined in (3) and the NP oracle classifier restricted to A, φ αA * (x), defined in (5). Although these two oracles have different thresholds, π 0 /π 1 vs. C αA * , the class-conditional density ratio p 1A (·)/p 0A (·) is involved in both oracles. The densities p 0A and p 1A can be estimated respectively from S ts 0(b) and S ts 1(b) by kernel density estimators where ℎ m 1 and ℎ n 1 denote the bandwidths, and K(·) is a kernel in ℝ l .

Sample-level classical ranking criterion
To define s-CC, we first construct plug-in classifiers ϕ A In each classifier, the threshold m 1 /n 1 mimics π 0 /π 1 . If classes 0 and 1 are sampled with probabilities π 0 and π 1 , respectively, then each classifier (3). However, in the presence of sampling bias, m 1 /n 1 cannot mimic π 0 /π 1 , and thus ϕ A Armed with the classifiers ϕ A (1) ( ⋅ ), …, ϕ A (B) ( ⋅ ), we define the sample-level CC (s-CC) of A as In other words, CC A is the average of the risks of ϕ A (b) ( ⋅ ) on the left-out observations

Sample-level Neyman-Pearson ranking criterion
To define s-NPC, we use the same kernel density estimates to plug in p 1A (·)/p 0A (·), as in s-CC. To estimate the oracle threshold C αA * , we use the NP umbrella algorithm (Tong et al., 2018). Unlike s-CC, in which we use both S lo 0(b) and S lo 1(b) to evaluate the constructed classifier, for s-NPC we use S lo 0(b) to estimate the threshold and only S lo 1(b) to evaluate the classifier.
The NP umbrella algorithm finds proper thresholds for all scoring-type classification methods (e.g., nonparametric density ratio plug-in, logistic regression, and RF) so that the resulting classifiers achieve a high probability control on the type I error under the pre-specified level α. A scoring-type classification method outputs a scoring function that maps the feature space X to ℝ, and a classifier is constructed by combining the scoring function with a threshold. To construct an NP classifier given a scoring-type classification method, the NP umbrella algorithm first trains a scoring function s A .
In this work, we specifically use s A ( ⋅ ) to S lo 0(b) to obtain scores T i (b) = s A (b) X m 1 + i 0(b) , i = 1, …, m 2 , which are then sorted in an increasing order and denoted by T (i) (b) , i = 1, …, m 2 . Third, for a user-specified type I error upper bound α ∈ (0, 1) and a violation rate δ 1 ∈ (0, 1), which refers to the probability that the type I error of the trained classifier exceeds α, the algorithm chooses the order k* = min k = 1, …, m 2 k: ∑ j = k m 2 m 2 j (1 − α) j α m 2 − j ≤ δ 1 .
When m 2 ≥ logδ 1 log(1 − α) , a finite k* exists, 4 and the umbrella algorithm chooses the threshold of the estimated scoring function as C αA . Hence, the resulting NP classifier is Proposition 1 in Tong et al. (2018) states that there is no more than δ 1 probability for the type I error of ϕ αA (b) ( ⋅ ) to exceed α: for every b = 1, …, B. We evaluate the type II error of the B NP classifiers ϕ αA where s A is the kernel density ratios constructed on S ts 0(b) ∪ S ts 1(b) using only the features indexed by A, and C αA The time complexity is O ((m 1 + n 1 ) · (m 2 + n 2 )) for calculating s-CC, and an additional complexity of O (m 2 log m 2 ) is needed for calculating s-NPC; both time complexities can be reduced to O(m + n) if approximate kernel density estimation is used and m 2 is bounded. We discuss the calculation details of the time complexities in Appendix E, and we illustrate the calculation of s-CC and s-NPC in Figure H.2.
For the implementation of s-NPC and s-CC, we use the kde() function with default arguments in the R package ks. By default, the function uses the Gaussian kernel and the bandwidth selected by the univariate plug-in selector of (Wand and Jones, 1994).

Revisiting the toy example at the sample level
With the above definitions of s-CC and s-NPC, we demonstrate how they rank the two features in the toy example ( Figure 1) and that their ranks are consistent with their population-level counterparts p-CC and p-NPC, respectively, with high probability.
We simulate 1000 samples, each of size N = 2000 (two classes combined), from the two-feature distribution (6) in the toy example ( Figure 1). With B = 11, we apply s-CC (8) and s-NPC with δ 1 = .05 (11) to each sample to rank the two features, and we calculate the frequency of each feature being ranked the top among the 1000 ranking results. Table  1 shows that s-NPC (α = .01) ranks feature 2 the top with high probability, while s-CC and s-NPC (α = .20) prefer feature 1 with high probability. This is consistent with our population-level result in Section 2.2: p-NPC (α = .01) prefers feature 2, while p-CC and p-NPC (α = .20) find feature 1 better.

Theoretical properties
In this section, we investigate the ranking properties of s-CC and s-NPC. Concretely, we will address this question: for J candidate feature index sets A 1 , …, A J of size l, is it guaranteed that s-CC and s-NPC have ranking agreements with p-CC and p-NPC respectively with high probability? In our theory development, we consider J as a fixed number; for simplicity, we assume the number of random splits to be B = 1 in s-CC and s-NPC, thus removing the super index (b) in all notations in this section and the Appendix proofs.
In addition to investigating ranking consistency, we discover a property unique to s-NPC: the robustness against sampling bias. Concretely, as long as the sample sizes m and n are large enough, s-NPC gives ranking consistent with p-NPC even when the class size ratio m/n in the sample is far from the ratio π 0 /π 1 in the population. In contrast, s-CC is not robust against sampling bias, except when the population class proportion ratio π 0 /π 1 is known and we replace the thresholds in the plug-in classifiers ϕ A (1) ( ⋅ ), …, ϕ A (B) ( ⋅ ), which are used for s-CC, by this ratio.
One example of β-valid kernels is the product kernel whose ingredients are kernels of order β in 1 dimension: where K 1 (·) is a 1-dimensional β-valid kernel and is constructed based on Legendre polynomials. Such kernels have been considered in Rigollet and Vert (2009). When a β-valid kernel is constructed out of Legendre polynomials, it is also Lipschitz and bounded. For simplicity, we assume that all the β-valid kernels considered in the theory discussion are constructed from Legendre polynomials.
The above condition for density functions was first introduced in Polonik (1995), and its counterpart in the classical binary classification was called the margin condition (Mammen and Tsybakov, 1999), which is a low noise condition. Recall that the set {x : η(x) = 1/2} is the decision boundary of the classical oracle classifier, and the margin condition in the classical paradigm is a special case of Definition 4 by taking f(·) = η(·) and C = 1/2. Unlike the classical paradigm where the optimal threshold 1/2 on regression function η(·) is known, the optimal threshold under the NP paradigm is unknown and needs to be estimated, thus suggesting the necessity of having sufficient data around the decision boundary to detect it. This concern motivated Tong (2013) to formulate a detection condition that works as an opposite force to the margin assumption, and Zhao et al. (2016) improved the condition and proved the condition's necessity in bounding the excess type II error of an NP classifier. To establish ranking consistency properties of s-NPC, a bound on the excess type II error is an intermediate result, so we also need this detection condition for our current work.

Definition 5 (Detection condition (Zhao et al., 2016))-A function f(·) satisfies
the detection condition of the order γ at the level (C, δ*) with respect to the probability distribution P of a random vector X, if there exists a positive constant C, such that for all δ ∈ (0, δ*),

Ranking property of s-CC
To study the ranking agreement between s-CC and p-CC, an essential step is to develop a concentration result between CC A and R φ A * , where φ A * was defined in (3), based on Proposition 6.
Under smoothness, regularity, and sample size conditions, Theorem 7 shows the concentration of CC A around R φ A * with probability at least 1 − (δ 3 +δ 4 +δ 5 ). The userspecified violation rate δ 3 accounts for the uncertainty in training the scoring function s A ( ⋅ ) on a finite sample; δ 4 represents the uncertainty of using left-out class 1 observations S lo 1 to estimate R 1 ϕ A ; δ 5 represents the uncertainty of using left-out class 0 observations S lo 0 to estimate R 0 ϕ A (Recall that R(·) = π 0 R 0 (·) + π 1 R 1 (·)). Like the constant C 0 in Proposition 6, the generic constant C 1 in Theorem 7 can be provided more explicitly, but it would be too cumbersome to do so. More discussion about Theorem 7 can be found after its counterpart for NPC, i.e., Theorem 12.
Theorem 7 leads to the ranking consistency of s-CC.
Li et al.
When m 1 /n 1 is far from π 0 /π 1 , clearly the classification errors of these two classifiers would not be close, so we would not have CC A − R φ A * small. As Theorem 7 is a cornerstone to ranking consistency between s-CC and p-CC, we conclude that the classical criterion is not robust to sampling bias.

Ranking property of s-NPC
To establish ranking agreement between s-NPC and p-NPC, an essential step is to develop a concentration result of NPC αA around R 1 φ αA * , where φ αA * is defined in (4). Recall that ϕ αA (x) = 1 s A x A > C αA = 1 p 0A x A /p 1A x A > C αA , where C αA is determined by the NP umbrella classification algorithm. We always assume that the cumulative distribution function of s A X A , where X ~ P 0 , is continuous.
, then the classifier ϕ αA satisfies with probability at least 1 − δ 1 − δ 2 , where and ⌈z⌉ denotes the smallest integer larger than or equal to z. Moreover, if The next proposition is a result of Lemma 10 and a minor modification to the proof of Proposition 2.4 in Zhao et al. (2016). We can derive an upper bound for R 1 ϕ αA − R 1 φ αA * the same as that for the excess type II error R 1 ϕ αA − R 1 φ αA * in Zhao et al. (2016).
Theorem 12-Let α, δ 1 , δ 2 , δ 3 , δ 4 ∈ (0, 1), and l = |A|. In addition to the assumptions of Propositions 6 and 11, assume n 2 ≥ log 2 δ 4 2 , then we have with probability at least 1 − (δ 1 + for some positive constant C 2 that does not depend on A. Under smoothness, regularity, and sample size conditions, Theorem 12 shows the concentration of NPC αA around R 1 φ αA * with probability at least 1 − (δ 1 + δ 2 + δ 3 + δ 4 ). The user-specified violation rate δ 1 represents the uncertainty that the type I error of an NP classifier ϕ αA exceeds α, leading to the underestimation of R 1 φ αA * ; δ 2 accounts Li et al. Page 15 for the possibility of unnecessarily stringent control on the type I error, which results in the overestimation of R 1 φ αA * ; δ 3 accounts for the uncertainty in training the scoring function s A ( ⋅ ) on a finite sample; δ 4 represents the uncertainty of using leave-out class 1 observations S lo 1 to estimate R 1 ϕ αA . Note that while δ 1 is both an input parameter for the construction of s-NPC and a constraint on the sample sizes, the other parameters δ 2 , δ 3 , and δ 4 only have the latter role. Like the constant C 0 in Proposition 6, the generic constant C 2 in Theorem 12 can be provided more explicitly, but it would be too cumbersome to do so.
Note that the upper bound in Theorem 12 involves γ while that in Theorem 7 does not. This is expected as the detection condition (that involves γ) is a condition for diminishing excess type II errors under NP paradigm. Here we make some example simplifications to digest the bounds in Theorems 7 and 12. If we assume that m 1 , m 2 , n 1 , n 2 ~ N (the total sample size), β = 2, l = 1, and γ = 1, then the high probability upper bound for γ . Hence, when γ > 8, i.e., when there are not many points around the NP oracle decision boundary and thus the boundary is difficult to detect, the convergence rate of the upper bound is slower for NPC.
Although Theorems 7 and 12 both assume bounded supports in their conditions, we regard this as just a way to streamline the proofs. In Appendix F, we conduct a simulation study where features are generated from distributions with unbounded supports, and there is still clear concentration of CC A and NPC αA .
Remark 14-The conclusion in Theorem 13 also holds under sampling bias, i.e., when the sample sizes n (of class 1) and m (of class 0) do not reflect the population proportions π 0 and π 1 .
Here we offer some intuition about the the robustness of s-NPC against sampling bias. Note that the objective and constraint of the NP paradigm only involve the class-conditional feature distributions, not the class proportions. Hence, p-NPC does not rely on the class proportions. Furthermore, in s-NPC, each class-conditional density is estimated separately within each class, so s-NPC does not depend on the class proportions either. It is also worth noting that the proof of Theorem 13 (in Appendix) does not use the relation between the ratio of class sizes in the sample and that in the population.
Moreover, we derive partial consistency results for s-CC and s-NPC in Appendix B, where we show that if the top J feature subsets and the other feature subsets have p-CC or p-NPC differ by a margin, then s-CC or s-NPC can distinguish the top J feature subsets.

Simulation studies
This section contains six simulation studies to verify the practical performance of s-CC and s-NPC in ranking features and to compare s-CC and s-NPC against multiple commonly used criteria for marginal and joint feature ranking. Table 2 summarizes the designs and purposes of the six studies.
In detail, we verify the performance of s-CC and s-NPC in ranking features under lowdimensional settings with the class-conditional distributions as Gaussian (studies S1 & S3) or chi-squared (study S2), as well as under high-dimensional settings (studies S4-S5). Furthermore, in studies S3 and S5, we compare s-CC and s-NPC with three commonly used measures of feature importance in a multivariate classifier trained by the RF algorithm-the SHAP value (Lundberg and Lee, 2017) and two feature importance measures (the mean decrease in accuracy and the mean decrease in Gini index). Moreover, we design study S6 to demonstrate the advantages of s-CC and s-NPC over four commonly used measures for marginal feature ranking-the Pearson correlation, the distance correlation (Székely et al., 2009), the two-sample t test, and the two-sample Wilcoxon rank-sum test. Besides, motivated by (Lin, 2002;Guyon et al., 2002), we implement variants of s-CC and s-NPC based on classifiers trained by the support vector machine (SVM) algorithm (Appendix C), and we show in study S3 that these variants are not robust to sampling biases, unlike s-NPC.
In all the simulation studies, we set the number of random splits B = 11 (which we show in Figure H.6 in Appendix as a reasonable choice) for s-CC and s-NPC, as well as their SVM variants, so that we can obtain reasonably stable criteria and meanwhile finish thousands of simulation runs within reasonable time. Regarding the RF algorithm, we use the randomForest() function in R package randomForest. The number of trees is set to ntree=500 by default.

Ranking low-dimensional features at the sample level
We first demonstrate the performance of s-CC and s-NPC in ranking features when d, the number of features, is much smaller than N (the total sample size). We design simulation studies S1 and S2 to support our theoretical results in Theorems 8 and 13 in the absence of sampling bias. Using simulation study S3, we demonstrate that s-NPC is robust to sampling bias, while s-CC and the SVM variants of s-CC and s-NPC are not; furthermore, we show that the RF algorithm's three feature ranking criteria (mean decrease in accuracy, mean decrease in Gini index, and SHAP value) cannot capture features' marginal ranking in the presence of feature correlations.
There is no sampling bias in simulation studies S1 and S2. In study S1, we generate data from the following two-class Gaussian model with d = 30 features, among which we set the first s = 10 features to be informative (a feature is informative if and only if it has different marginal distributions in the two classes).
We simulate 1000 samples of size N = 400 6 or 1000 from the above model. We apply s-CC (8) and s-NPC with δ 1 = .05 and four α levels .05, .10, .20, and .30 (11), five criteria in total, to each sample to rank the 30 features. That is, for each feature, we obtain 1000 ranks by each criterion. We summarize the average rank of each feature by each criterion in Tables  H.1 and H.2 in Appendix, and we plot the distribution of ranks of each feature in Figures 2  and 3. The results show that all criteria clearly distinguish the first 10 informative features from the rest. For s-NPC, we observe that its ranking is more variable for a smaller α (e.g., 0.05). This is expected because, when α becomes smaller, the threshold in the NP classifiers would have an inevitably larger variance and lead to a more variable type II error estimate, i.e., s-NPC. As the sample size N increases from 400 (Table H.1 and Figure 2) to 1000 (Table H.2 and Figure 3), all criteria achieve greater agreement with the true ranks.
In study S2, we generate data from the following two-class Chi-squared distributions of d = 30 features, among which we still set the first s = 10 features to be informative.
Similar to the previous Gaussian setting, the first 10 features have true ranks going down from 1 to 10, and the rest of features are tied with a true rank of 20.5. We simulate 1000 samples of size N = 400 or 1000 from this model, and we apply s-CC (8) and s-NPC with δ 1 5. Why the uninformative 20 features should receive a true rank of 20.5 instead of 11 is because a reasonable ranking criterion, with data randomness, would assign these 20 features with ranks uniformly distributed between 11 and 30. 6. In the NP umbrella algorithm, m 2 , i.e., the class 0 sample size reserved for estimating the threshold, must be at least 59 when α = δ 1 = .05. We set the overall sample size to N = 400 so that the expected m 2 is 100; then the realized m 2 is larger than 59 with high probability. = .05 and four α levels . 05, .10, .20, and .30 (11), five criteria in total, to each sample to rank the 30 features. We summarize the average rank of each feature by each criterion in Tables  H.3 and H.4 (in Appendix), and we plot the distribution of ranks of each feature in Figures  H.3 and H.4 (in Appendix). The results and conclusions are consistent with those under the Gaussian setting.
Next, we design study S3 with sampling bias, i.e., the two classes have proportions in the sample different from those in the population. We use the following Gaussian setting: X (Y = 0) N μ 0 , Σ , X (Y = 1) N μ 1 , Σ , π 1 population = ℙ population (Y = 1) = . 5, π 1 sample = ℙ sample (Y = 1) = . 1, that is, class 1 has a proportion .5 in the population but is undersampled with probability .1 in the sample. Same as in our first simulation study, we set independently and identically drawn from N(0, 1) and then held fixed, and we still set the diagonal entries of Σ to 4. What is different here is that we add a scenario with feature correlations: conditional on each class, features i and j have a correlation ρ ij = .9 |i−j| , i, j = 1, …, 30; that is, Σ is a Toeplitz-type matrix with the (i, j)-th entry equal to .9 |i−j| × 4.
Here features 1 to 10 still have their true ranks going down from 1 to 10, while the other features still have a tied true rank of 20.5. We simulate 1000 samples of size N = 1000 from this model, and we apply s-CC (8); s-NPC with δ 1 = .05 and four α levels .05, .10, .20, and .30 (11); their corresponding SVM variants (Appendix C); and the RF algorithm's feature importance measures (mean decrease in accuracy and mean decrease in Gini index) and SHAP value-13 criteria in total-to each sample to rank the 30 features. In Appendix, we summarize the average rank of each feature by each criterion in Table H.5 (for the uncorrelated-feature scenario) and Table H.6 (for the correlated-feature scenario), and we plot the distribution of ranks of each feature in Figure H.5 (for the uncorrelated-feature scenario) and Figure 4 (for the correlated-feature scenario). The results show that, in the presence of sampling bias, only s-NPC can distinguish the first 10 informative features from the rest, while s-CC and the SVM variants of s-CC and s-NPC cannot. These results highlight the unique robustness of s-NPC to sampling bias, an advantage that even its SVM variant does not embrace. The reason is that sampling bias affects the training of the SVM scoring function. Moreover, s-CC, s-NPC, and their SVM variants-all marginal feature ranking criteria-are unaffected by feature correlations, as expected. In contrast, the RF algorithm's three feature ranking criteria, which are based on multivariate classifiers and thus not marginal, cannot accurately capture the features' marginal ranking in the presence of feature correlations (Figure 4(c)).

Ranking high-dimensional features at the sample level
We next examine the performance of s-CC and s-NPC when d > N, using simulation studies S4 and S5. In study S4, we set d = 500 and N = 400. The generative model is the same as …, μ 500 independently and identically drawn from N(0, 1) and then held fixed, and Σ = 4I 500 . Same as in the low-dimensional settings (Section 5.1), p-CC and p-NPC have a clear gap between the first 10 informative features and the rest of features. In terms of both p-CC and p-NPC, the first 10 features have true ranks going down from 1 to 10, and the rest of features are tied with a true rank of 255.5. We simulate 1000 samples of size N = 400 and apply s-CC (8) and s-NPC with δ 1 = .05 and four α levels . 05, .10, .20, and .30 (11) to each sample to rank the 500 features. We summarize the average rank of each feature by each criterion in Table H.7 (in Appendix), and we plot the distribution of ranks of each feature in Figure 5. The results show that ranking under this high-dimensional setting is more difficult than under the low-dimensional setting. However, s-CC and s-NPC with α = 0.2 or 0.3 still clearly distinguish the first 10 informative features from the rest, while s-NPC with α = 0.05 or 0.1 have worse performance on features 8-10, demonstrating again that ranking becomes more difficult for s-NPC when α is smaller and requires a larger sample size N.
In the more high-dimensional study S5, we further decrease the N/d ratio by setting d = 10,000 and N = 200, a scenario that resembles many biomedical datasets.
The generative model is the same as (15), and μ 0 = (−2.5, …, − 2.5 10 , μ 11 , …, μ 10, 000 ) ⊤ , μ 1 = (1, . 9, …, . 2, . 1 10 , μ 11 , …, μ 10, 000 ) ⊤ , with μ 11 , …, μ 10,000 independently and identically drawn from N(0, 1) and then held fixed, and Σ = 4I 10,000 . Same as in study S4, the first 10 features have true ranks going down from 1 to 10, and the rest of features are tied with a true rank of 5005.5. We simulate 1000 samples of size N = 200, and we apply s-CC (8); s-NPC with δ 1 = .05 and three α levels .10, .20, and .30 (11); their SVM variants (Appendix C); and the RF's two feature importance measures (the mean decrease in accuracy and the mean decrease in Gini index)-ten criteria in total-to each sample to rank the 10,000 features. Note that the SHAP value is inapplicable because it requires substantial computational time in this scenario. We summarize the average rank of each feature by each criterion in Table  H.8 (in Appendix). The results show that s-NPC outperforms its SVM variant at each α in terms of the ranking accuracy for the top 10 features. Moreover, five criteria, including s-CC, s-NPC (α = .30), their SVM variants, and RF's mean decrease in accuracy, can distinguish the first 10 informative features by assigning them average ranks no greater than 10. The fact that s-NPC and its SVM variant with α = .10 or .20 have worse performance is because of the small N/d ratio. This result echos the importance of using a not-too-small α for s-NPC when N is not too large and N/d is small. Interestingly, RF's mean decrease in Gini index performs worse than its mean decrease in accuracy in ranking the 10-th feature.

Comparison with other marginal feature ranking criteria
In simulation study S6, we compare s-CC and s-NPC with four criteria that have been widely used to rank features marginally: the Pearson correlation, the distance correlation (Székely et al., 2009), the two-sample t test, and the two-sample Wilcoxon rank-sum test.
To calculate p-CC and p-NPC with δ 1 = .05 at four α levels .05, .10, .20, and .30 on these two features, we use a large sample with size 10 6 for approximation, and the results in Table  3 show that all the five population-level criteria rank feature 2 as the top feature.
Then we simulate 1000 samples of size N = 400 from the above model and apply nine ranking approaches: s-CC, s-NPC with δ 1 = .05 at four α levels (.05, .10, .20, and .30), the Pearson correlation, the distance correlation, the two-sample t test, and the two-sample Wilcoxon rank-sum test, to each sample to rank the two features. From this we obtain 1000 rank lists for each ranking approach, and we calculate the frequency that each approach correctly finds the true rank order. The frequencies are summarized in Table 4, which shows that none of the four common criteria identifies feature 2 as the better feature for prediction. In other words, if users wish to rank features based on a prediction objective under the classical or NP paradigm, these criteria are not suitable.

Real data applications
We apply s-CC and s-NPC to two real datasets to demonstrate their wide application potential in biomedical research. Here we set the number of random splits B = 1000 for s-CC and s-NPC, as allowed by our computational resource.

Application 1: classification of breast cancer and normal tissues based on genes' DNA methylation levels
The To facilitate the interpretation of our analysis results, we further process the data as follows. First, we discard a CpG probe if it is mapped to no gene or more than one genes. Second, if a gene contains multiple CpG probes, we calculate its methylation level as the average methylation level of these probes. This procedure leaves us with 19,363 genes with distinct methylation levels in every tissue. We consider the tissues as data points and the genes as features, so we have a sample with size N = 285 and number of features d = 19,363. Since misclassifying a patient with cancer to be healthy leads to more severe consequences than the other way around, we code the 239 breast cancer tissues as the class 0 and the 46 normal tissues as the class 1 to be aligned with the NP paradigm. After applying s-CC (8) and s-NPC with δ 1 = .05 and four α levels (.05, .10, .20, and .30) (11) to this sample, we summarize the top 10 genes found by each criterion in Table 5. Most of these top ranked genes have been reported associated with breast cancer, suggesting that our proposed criteria can indeed help researchers find meaningful features. Meanwhile, although other top ranked genes do not yet have experimental validation, they have weak literature indication and may serve as potentially interesting targets for cancer researchers. For a detailed list of literature evidence, please see the Supplementary Excel File. The fact that these five criteria find distinct sets of top genes is in line with our rationale that feature importance depends on prediction objective. By exploring top features found by each criterion, researchers will obtain a comprehensive collection of features that might be scientifically interesting.
Moreover, we apply the four widely-used but non-prediction-based marginal ranking criteria -the Pearson correlation, the distance correlation, the two-sample t test, and the two-sample Wilcoxon rank-sum test, to rank the d = 19,363 genes. For demonstration purpose, we check the s-CC and s-NPC (α = .10) values of the genes ranked as the 1st, 101st, 201st, 301st, 401st, 501st, 601st, 701st, 801st, and 901st by each criterion. Figure 6 shows the cases where the ranks assigned by other criteria differ tremendously from those assigned by s-CC or s-NPC (α = .10), including the 301st and 801st genes ranked by the distance correlation, whose ranks by s-CC and s-NPC (α = .10) are much better, and the 601st and 701st genes ranked by the two-sample t test, whose ranks by s-CC and s-NPC (α = .10) are much worse. Figure 7 illustrates the class-conditional distributions of these four genes. Interestingly, the two genes NXPH1 and TSC22D4 are ranked top by s-CC (with ranks 18 and 77) and s-NPC (α = .10) (with ranks 99 and 146), while they are ranked worse by both the Pearson correlation (with ranks 1061 and 622) and the distance correlation (with ranks 801 and 301). Their class-conditional distributions show distinct, non-overlapping density peaks between classes 0 and 1, confirming their top ranks assigned by s-CC and s-NPC. Literature also suggests the two genes' potential roles in breast cancer: a methylation study found NXPH1 as a candidate biomarker gene for HER2+ breast cancer (Lindqvist et al., 2014); an RNA-seq study found TSC22D4 up-regulated in a BRCA1 mutated cell line (SL) compared to a BRCA1 wild-type cell line (SB) (Privat et al., 2018); another study identified TSC22D2, an isoform of TSC22D4, as a novel cancer-associated gene in a rare multi-cancer family (Liang et al., 2016).
In Figure 7, another two genes, COL16A1 and CUBN, are ranked much lower by s-CC (with ranks 3793 and 2687) and s-NPC (α = .10) (with ranks 3924 and 4792) than by the two-sample t test (with ranks 601 and 701). Inspecting the two genes' class-conditional distributions, we find that their class-1 conditional distributions have high-density domains largely contained in the high-density domains of their respective class-0 conditional distributions, an observation consistent with the low rankings assigned by s-CC and s-NPC (α = .10). Both genes do not seem to have direct associations with breast cancer: COL16A1 encodes the alpha chain of type XVI collagen; CUBN encodes a protein called cubilin, which is involved in the uptake of vitamin B12 from food into the body.

Application 2: classification of high-risk and low-risk prostate cancer patients based on microRNA expression levels in urine samples
The second dataset contains microRNA (miRNA) expression levels in urine samples of prostate cancer patients, downloaded from the GEO with accession number GSE86474 (Jeon et al., 2019). This dataset is composed of 78 high-risk and 61 low-risk patients. To align with the NP paradigm, we code the high-risk and low-risk patients as class 0 and 1, respectively, so m/n = 78/61. In our data pre-processing, we retain miRNAs that have at least 60% non-zero expression levels across the N = 139 patients, resulting in d = 112 features. We use this dataset to demonstrate that s-NPC is robust to sampling bias that results in disproportional training data; that is, training data have different class proportions from those of the population. Since in many biomedical datasets, the proportions of diseased patients do not reflect the true proportions in the population, a desirable feature ranking criterion should be robust to such sampling bias so that the selected features would maintain good out-of-sample predictive power.
We create two sub-datasets by randomly removing one half of the data points in class 0 or 1, so that one sub-dataset has m/n = 39/61 and the other has m/n = 78/31. We apply s-CC, s-NPC with δ 1 = .05, and the RF algorithm's feature importance measures (the mean decrease in accuracy and the mean decrease in Gini index) to the full dataset and each sub-dataset to rank features. To evaluate each criterion's robustness to disproportional data, we compare its rank lists from these three datasets with different m/n ratios. For this comparison, we use the Kuncheva index (Kuncheva, 2007), which quantifies the overlap of top k feature sets and accounts for the overlap by chance.
where A k and B k are the top k features from two rank lists. The Kuncheva index has range [−1, 1], and its value is monotone increasing in |A k ∩ B k |. If A k and B k overlap by chance, the Kuncheva index would be close to 0. For more than two rank lists, their Kuncheva index for the top k features is defined as the average of their pairwise Kuncheva indices for the top k features. In our case, the larger the Kuncheva indices are for varying k, the more robust a criterion is to disproportional data. We illustrate the Kuncheva indices of s-CC, s-NPC, and the two RF feature importance measures in Figure 8, which shows that s-NPC is the most robust criterion.

Conclusion and perspectives
This work introduces model-free objective-based marginal feature ranking criteria-s-CC and s-NPC-for the purpose of binary decision-making. The explicit use of a prediction objective to rank features is demonstrated to select more predictive features than existing practices for marginal or multivariate feature ranking do. The reason is that commonly used marginal feature ranking criteria are association measures not reflecting the prediction objective or capturing certain data distributional characteristics, and that popular multivariate feature ranking criteria rely on multivariate classification algorithms such as SVM and RF, which are sensitive to sampling bias and feature correlations.
It is worth nothing that s-CC and s-NPC are extendable to multi-class classification. For s-NPC, as it is based on the Neyman-Pearson paradigm for binary classification, its extension to multi-class classification will rely on the one-vs-rest approach. Concretely, we single out the most important class as the class 0 and combine the rest of classes into the class 1, converting the multi-class problem into a binary classification problem; then s-NPC can be applied. For s-CC, with K classes, the classical oracle classifier is known to be φ*(·) = arg max k∈{1, …, K} π k p k (·), where π k = ℙ(Y = k) and p k (·) denotes the the class-k conditional density of feature(s). Then we can define s-CC as the average classification error of plug-in classifiers of this oracle, similar to equations (7) and (8) in our manuscript. We have implemented the multiple-class extensions of s-CC and s-NPC in our R package frc.
In addition to the illustrated CC and NP paradigms, the same marginal ranking idea extends to other prediction objectives such as the cost-sensitive learning and global paradigms. Another extension direction is to rank feature pairs in the same model-free fashion. In addition to the biomedical examples we show in this paper, model-free objective-based marginal feature ranking is also useful for finance applications, among others. For example, a loan company has successful business in region A and would like to establish new business in region B. To build a loan-eligibility model for region B, which has a much smaller fraction of eligible applicants than region A, the company may use the top ranked features by s-NPC in region A, thanks to the robustness of s-NPC to sampling bias.
Both s-CC and s-NPC involve sample splitting. The default option is a half-half split for both class 0 and class 1 observations. It remains an open question whether a refined splitting strategy may lead to a better ranking agreement between the sample-level and population-level criteria. Intuitively, there is a trade-off between classifier training and objective evaluation: using more data for training can result in a classifier closer to the oracle, while saving more data to evaluate the objective can lead to a less variable criterion.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material.
That is, each feature has the same class-conditional variance under the two classes. For α ∈ (0, 1), let φ α 1 * or φ α 2 * be the level-α NP oracle classifier using only the feature X {1} or X {2} respectively, and let φ 1 * or φ 2 where sign(·) is the sign function.

Appendix B.: Partial consistency results of s-CC and s-NPC
In this section, we present partial consistency results for s-CC (Theorem B.2) and s-NPC (Theorem B.3). The proofs are skipped due to the similarities to those of Theorem 8 and Theorem 13.
In addition to the assumptions in Theorem 12, assume that the p-NPC's separate the first J feature sets and the last H by some margin g > 0; in other words, In addition, assume m 1 , m 2 , n 1 , n 2 satisfy that where C 2 is the generic constant in Theorem 12. Then with probability at least 1 − (J + H)(δ 3 + δ 4 + δ 5 ), max i∈{1, …, J} NPC αAi < min i∈{J + 1, …, J + H} NPC αAi .

Appendix C.: Variants of s-CC and s-NPC based on SVM classifiers
In parallel to s-CC and s-NPC, which are defined based on plug-in classifiers whose scoring function is the ratio of kernel density estimates, we implement variants of s-CC and s-NPC based on SVM classifiers. Specifically, we change the scoring function s A  (8). Similarly, we construct the NP classifier (9) using the NP umbrella algorithm; then we define the s-NPC variant, called s-NPC-SVM, by following (11).

D.1 Proof of Lemma A.1
First we realize that the following three statements are equivalent:

1.
Feature importance ranking under the NP paradigm is invariant to α;

2.
Feature importance ranking under the classical paradigm is invariant to π 0 ;
We explore conditions for statement (1) to hold. We will divide our analysis into four scenarios (i)-(iv) regarding distribution means.
These oracle classifiers have type II errors The chain of equivalence holds
These oracle classifiers have type II errors Hence, Finally to sum up scenarios (i)-(iv), we conclude that
These combined with for some positive constant C that does not depend on the subset A.

Author Manuscript
Author Manuscript

D.4 Proof of Theorem 8
Theorem 7 and the sample size condition imply that for each i ∈ {1, …, J}, we have with probability at least 1 − (δ 3 + δ 4 + δ 5 ), Also considering the separation conditions that
Let 2e −2n 2 D 2 = δ 4 and then D = for some generic constant C. The thresholds C αA * 's are bounded from above by a single constant that does not depend on A; indeed, we can just take the upper bound to be μ max /μ min . Therefore, we have with probability at least 1 − δ 1 − δ 2 − δ 3 − δ 4 ,

Appendix E.: Time complexity of s-CC and s-NPC
The time complexity discussion is for one feature and B = 1.
For s-CC, its time complexity depends on the univariate kernel density estimation (KDE) and evaluation. In s-CC, we apply KDE to the m 1 observations in class 0 and the n 1 observations in class 1 to obtain the density estimates of the two class-conditional distributions; then we evaluate the two density estimates on the m 2 observations in class 0 and the n 2 observations in class 1 to obtain m 2 + n 2 density ratio estimates, which, together with the threshold m 1 /n 1 , gives the s-CC (Equation ( For s-NPC, in addition to the time complexity of KDE and evaluation, additional complexity arises from the threshold searching step in the NP umbrella algorithm (Tong et al., 2018), which involves sorting the m 2 density ratio estimates of the left-out class 0 observations and has time complexity O(m 2 log m 2 ).
In conclusion, when the training sample size is large, we can use approximate KDE methods to reduce the time complexity of s-CC to O(N), where N = m + n is the total sample size. For s-NPC, we can set an upper bound on m 2 so that the time complexity is also O(N).

Appendix F.: Relationships between numerical high-probability error bounds and sample sizes
In practice, we cannot realize the theoretical high-probability bounds on the differences between s-CC (or s-NPC) and p-CC (or p-NPC) due to unspecified constants (Theorem 7 and Theorem 12). To provide guidance for practitioners, here we investigate the relationships between the numerical, realized high-probability error bounds and sample sizes. We consider the following four features X {1} , X {2} , X {3} , X 4 ∈ ℝ, 7 whose classconditional distributions are the following Gaussians: X 4 | (Y = 0) N(0, 1), X 4 | (Y = 1) N(1, 1), (F.5) and the class priors are equal, i.e., π 0 = π 1 = .5. It can be calculated that the four features have the following p-CC and p-NPC values. We design a simulation study with five sample sizes N = 400, 600, 800, 1000, and 1200, i.e., the total number of observations from both classes. For each sample size, we simulate 5000 independent training datasets, and we calculate s-CC and s-NPC with α = .10, .20, and .30 for each feature on each dataset. Figure H.7 shows the trends of 80% percentiles (i.e., 80%-probability upper bounds) of absolute differences between sample-level criteria and their population counterparts for each feature and each N. As expected, all the upper Distributions of s-CC or s-NPC (with varying α) values of one feature, whose classconditional distributions are X | (Y = 0) N −1.5, 2 2 and X | (Y = 1) N 1, 2 2 and whose class prior is ℙ(Y = 1) = . 5, with varying B (number of random splits) and N (sample size) across 1000 independent simulations. Based on these distributions, B = 11 is a reasonable choice. 80% percentile (from 5000 independent simulations) of absolute differences between sample-level criteria and their population counterparts for four features (F.5) with varying sample size N. The decay rate is close to O(N −1/2 ).  and π 1 sample = . 1) (17) and without feature correlations-simulation study S3. Undesirable average ranks-i.e., the average ranks of the informative (first 10) features greater than 10 and the average ranks of the uninformative (latter 20) features smaller than 10-are underlined. Average ranks of the d = 30 features by s-CC, s-NPC (with varying α), their SVM variants, or the RF algorithm's feature importance measures and SHAP value, with sample size N = 1000 under the Gaussian setting with sampling bias (π 1 population = . 5 and π 1 sample = . 1) (17) and a Toeplitz-type feature covariance matrix: features i and j have a correlation ρ ij = .9 |i−j| , i, j = 1, …, 30-simulation study S3. Undesirable average ranks-i.e., the average ranks of the informative (first 10) features greater than 10 and the average ranks of the uninformative (latter 20) features smaller than 10-are underlined.    The frequency that each ranking approach identifies the true rank order.